Recovery in Multicomputers with Finite Error Detection Latency

نویسندگان

  • P. Krishna
  • Nitin H. Vaidya
  • Dhiraj K. Pradhan
چکیده

P. Krishna N. H. Vaidya D. K. Pradhan Computer Science Department Texas A&M University College Station, TX 77843-3112 Abstract In most research on checkpointing and recovery, it has been assumed that the processor halts immediately in response to any internal failure (fail-stop model). This paper presents a recovery scheme (independent checkpointing and message logging) for a multicomputer system consisting of processors having a nonzero error detection latency. Our scheme tolerates bounded error detection latencies, thus, achieving a higher fault coverage. The simulation results show that for typical detection latency values, the recovery overhead is almost independent of the detection latency.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Error Recovery Mechanism using Dynamic Partial Reconfiguration

In this paper an error recovery mechanism for SRAM based FPGA systems is presented. Previous recovery methods employ processor cores as a reconfiguration controller consuming notable amount of device resources and introducing additional error detection and recovery latency. The described mechanism is controlled by a finite state machine architecture providing small hardware overhead and short r...

متن کامل

A Software-Based Hardware Fault Tolerance Scheme for Multicomputers

A hardware fault tolerance scheme for large multicomputers executing time-consuming non-interactive applications is described. Error detection and recovery are done mostly by software with little hardware support. The scheme is based on simultaneous execution of identical copies of the application on two subnetworks of the system. Normal system operation is periodically suspended and the logica...

متن کامل

Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers

Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sensitive to the frequency of interprocess communication in the applications. For message-passing systems, low overhead error recovery based on coordinated checkpointing allows the frequency of checkpointing to be determin...

متن کامل

Application - Transparent Process - Level Error Recovery for Multicomputers

A multicomputer system consisting of hundreds of processors interconnected by point-to-point links can achieve high performance for many important applications. We propose a new application-transparent, process-level, distributed error recovery scheme for multicomputers. Checkpointing is initiated by timers at intervals determined by the needs of the application. Checkpointing and recovery invo...

متن کامل

Error Recovery in Multicomputers Using Global Checkpoints

Periodic checkpointing of the entire system state and rolling back to the last checkpoint when an error is detected is proposed as a basis for error recovery on a VLSI multicomputer executing non-interactive applications. Detailed algorithms for saving the checkpoints, distributing diagnostic information, and restoring a valid system state are presented. This approach places no restrictions on ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994